The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
We present sharper upper and lower bounds for a known polynomial-time approximation scheme due to Li, Ma and Wang [7] for the Consensus-Pattern problem. This NP-hard problem is an abstraction of motif finding, a common bioinformatics discovery task. The PTAS due to Li et al. is simple, and a preliminary implementation [8] gave reasonable results in practice. However, the previously known bounds on...
The longest common subsequence problem (LCS) and the closest substring problem (CSP) are two models for the finding of common patterns in strings. The two problem have been studied extensively. The former was previously proved to be not polynomial-time approximable within ratio nδ for a constant δ. The latter was previously proved to be NP-hard and have a PTAS....
In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be built to report all occurrences, all positions, or all documents where a pattern occurs in time linear in the size of the query string and the number of...
The compressed suffix array and the compressed suffix tree for a given string S are full-text index data structures occupying O(nlog|Σ|) bits where n is the length of S and Σ is the alphabet from which symbols of S are drawn. When they were first introduced, they were constructed from suffix arrays and suffix trees, which implies they were not constructed in optimal O(nlog|Σ|)-bit working space. Recently,...
A succinct full-text self-index is a data structure built on a text T=t1t2... tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P=p1p2... pm in T, and is able to reproduce any text...
The suffix array is a fundamental index data structure in string algorithms and bioinformatics, and the compressed suffix array (CSA) and theFM-index are its compressed versions. Many algorithms for constructing these index data structures have been developed. Recently, Hon et al. [11] proposed a construction algorithm using O(n ·loglog|Σ|) time and O(nlog|Σ|)-bit working space, which is the fastest...
We present new faster algorithms for the problems of δ and (δ, γ)-matching on numeric strings. In both cases the running time of the proposed algorithms is shown to be O(δn log m), where m is the pattern length, n is the text length and δ a given integer. Our approach makes use of Fourier transform methods and the running times are independent of the alphabet size. $O(n\sqrt{m\log{m}})$ algorithms...
Approximate string matching is a fundamental and challenging problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA sequence analysis. In this paper, we present a fast algorithm for approximate string matching, called FAAST. It aims at solving a popular variant of the approximate string matching problem, the k-mismatch problem...
Approximate matching is one of the fundamental problems in pattern matching, and a ubiquitous problem in real applications. The Hamming distance is a simple and well studied example of approximate matching, motivated by typing, or noisy channels. Biological and image processing applications assign a different value to mismatches of different symbols. We consider the problem of approximate...
Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Here, we point out that condensed neighborhoods are not a minimal representation of a pattern neighborhood. We show that we can restrict our attention to super condensed neighborhoods which are minimal. We then present an algorithm for generating Super Condensed Neighborhoods...
In the median problem, we are given a distance or dissimilarity measure d, three genomes G1,G2, and G3, and we want to find a genome G (a median) such that the sum ∑ d(G,Gi) is minimized. The median problem is a special case of the multiple genome rearrangement problem, where one wants to find a phylogenetic tree...
Permutations on strings representing gene clusters on genomes have been studied earlier in [3, 12, 14, 17, 18] and the idea of a maximal permutation pattern was introduced in [12]. In this paper, we present a new tool for representation and detection of gene clusters in multiple genomes, using PQ trees [6]: this describes the inner structure and the relations between clusters succinctly, aids in filtering...
Speeding up approximate pattern matching is a line of research in stringology since the 80’s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are excluded (filtered out) in a first step, and remaining regions are compared to the pattern by dynamic programming in a second step. Among the necessary conditions used to test similarity...
Weighted Directed Word Graph is a new text-indexing structure which is based on a new compaction method on DAWG. WDWGs are basically cyclic, means that they may accept infinite strings. But by assigning weights to the edges, the acceptable strings are limited only to the substrings of input strings. The size of WDWGs is smaller than that of DAWGs both in theory and practice. A linear-time on-line...
We present a new simple algorithm that constructs an Aho Corasick automaton for a set of patterns, P, of total length n, in O(n) time and space for integer alphabets. Processing a text of size m over an alphabet Σ with the automaton costs $O(m \log \left|\Sigma\right| + k)$ , where there are k occurrences of patterns in the text.
We introduce a generalization of the Burrows-Wheeler Transform (BWT) that can be applied to a multiset of words. The extended transformation, denoted by E, is reversible, but, differently from BWT, it is also surjective. The E transformation allows to give a definition of distance between two sequences, that we apply here to the problem of the whole mitochondrial genome phylogeny. Moreover we give...
Standard compression algorithms are not able to compress DNA sequences. Recently, new algorithms have been introduced specifically for this purpose, often using detection of long approximate repeats. In this paper, we present another algorithm, DNAPack, based on dynamic programming. In comparison with former existing programs, it compresses DNA slightly better, while the cost of dynamic programming...
Gene structure prediction is one of the most important problems in computational molecular biology. A combinatorial approach to the problem, denoted Gene Prediction via Spliced Alignment, was introduced by Gelfand, Mironov and Pevzner [5]. The method works by finding a set of blocks in a source genomic sequence S whose concatenation (splicing) fits a target gene T belonging to a homologous species...
Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation...
In this paper we define a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of elements of the motifs. For this class of problems we give an algorithm that is a suitable extension of the KMR [3] paradigm and, in particular, of the KMRC [7] as it uses a degenerate alphabet...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.